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ABSTRACT 

This work proposes V-SMART-Join, a scalable MapReduce- 
based framework for discovering all pairs of similar entities. 
The V-SMART-Join framework is applicable to sets, mul- 
tisets, and vectors. V-SMART-Join is motivated by the 
observed skew in the underlying distributions of Internet 
traffic, and is a family of 2-stage algorithms, where the first 
stage computes and joins the partial results, and the second 
stage computes the similarity exactly for all candidate pairs. 
The V-SMART-Join algorithms are very efficient and scal- 
able in the number of entities, as well as their cardinalities. 
They were up to 30 times faster than the state of the art 
algorithm, VCL, when compared on a real dataset of a small 
size. We also established the scalability of the proposed al- 
gorithms by running them on a dataset of a realistic size, 
on which VCL never succeeded to finish. Experiments were 
run using real datasets of IPs and cookies, where each IP 
is represented as a multiset of cookies, and the goal is to 
discover similar IPs to identify Internet proxies. 

1. INTRODUCTION 

The recent proliferation of social networks, mobile appli- 
cations and online services increased the rate of data gath- 
ering. Such services gave birth to Internet-traffic-scale prob- 
lems that mandate new scalable solutions. Each online surfer 
contributes to the Internet traffic. Internet-traffic-scale prob- 
lems pose a scalability gap between what the data analysis 
algorithms can do and what they should do. The MapRe- 
duce [11] framework is one major shift in the programming 
paradigms proposed to fill this gap by distributing algo- 
rithms across multiple machines. 

This work proposes the V-SMART-Join (Versatile Scal- 
able MApReduce all-pair similariTy Join) framework as a 
scalable exact solution to a very timely problem, all-pair 
similarity joins of sets, multisets and vectors. This problem 
has attracted much attention recently [2, 3, 4, 5, 6, 9, 10, 
13, 22, 29, 33, 34] in the context of several applications. The 
applications include clustering documents and web content 
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[3, 13, 34], detecting attacks from colluding attackers [22], 
refining queries and doing collaborative filtering [4] , cleaning 
data [2, 10], and suggesting friends in social services based 
on common interests [12]. 

The motivating application behind this work is commu- 
nity discovery, where the goal is to discover strongly con- 
nected sets of entities in a huge space of sparsely-connected 
entities. The mainstream work in the field of community dis- 
covery [20, 27, 30, 36] has assumed the relationships between 
the entities are known a priori, and has proposed clustering 
algorithms to discover communities. While the relationships 
between entities are usually volunteered by domain experts, 
like in the case of bioinformatics, or by the entities them- 
selves, like in social networks, this is not always the case. 
When information about the relationships is missing, it is 
reasonable to interpret high similarity between any two enti- 
ties as an evidence of an existing relationship between them. 
Hence, our focus is to discover similar pairs of entities. 

We propose using community discovery for classifying IP 
addresses as load balancing proxies. An Internet Service 
Provider (ISP) that assigns dynamic IP addresses (IPs for 
short) to its customers sends their traffic to the rest of the 
Internet via a set of proxy IPs. For advertisement target- 
ing, and traffic anomalies detection purposes, it is crucial 
to identify these load balancing proxies, and treat each set 
of load balancers as one indivisible source of traffic. For 
instance, for the application of traffic anomalies detection 
based on the source IP of the traffic [23, 24], the same 
whitelisting/blacklisting decision should be taken for all the 
IPs of an ISP load balancer. For the application of targeting 
advertisement, the IP of the surfer gets resolved to a specific 
country or city, and the ads are geographically targeted ac- 
cordingly. Some ISPs provide services in multiple locations, 
and their IPs span an area wider than the targeting granu- 
larity. No ads should be geo-targeted for the IPs of the same 
load balancer if the IPs resolve to multiple locations. 

To that end, we propose representing each IP using a mul- 
tiset, also known as a bag, of the cookies that appear with 
it, where the multiplicity of the cookies is the number of 
times it appeared with the IP. Identifying IPs of a load bal- 
ancer reduces to finding all pairs of IPs with similar multi- 
sets of cookies. Representing IPs as multisets, as opposed 
to sets, makes the results more sensitive to the activities of 
the cookies, and hence increases the confidence in the re- 
sults. A post-processing step is to cluster these IPs, where 
each pair of similar IPs are connected by an edge in an IP- 
similarity graph. A clusters correspond to IPs of the same 
load balancer. This work complements the work in [24] that 
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Figure 1: The Map Reduce framework. 



estimates the number of users behind IPs, which can also be 
used for identifying large Internet proxies. 

To discover all pairs of similar IPs, this work proposes V- 
SMART-Join, a scalable MapReduce based framework. The 
contributions of this work is as follows. 

1. Versatility: V-SMART-Join is carefully engineered to 
work on vectors, sets, and multisets using a wide vari- 
ety of similarity measures. 

2. Speed and Scalability: V-SMART-Join employs a two 
stage approach, which achieves significant scalability 
in the number of entities, as well as their cardinali- 
ties, since it does not entail loading whole entities into 
the main memory. Moreover, V-SMART-Join care- 
fully handles skewed data distributions. 

3. Wide Adoption: The proposed V-SMART-Join algo- 
rithms can be executed on the publicly available ver- 
sion of MapReduce, Hadoop [1]. 

4. Experimental Verification: On real datasets, the V- 
SMART-Join algorithms ran up to 30 times faster than 
the state of the art algorithm, VCL [33]. 

The rest of the paper is organized as follows. The MapRe- 
duce framework is explained in § 2. In § 3, the problem is for- 
malized and an insight is presented to build distributed algo- 
rithms. This insight is based on a classification of the partial 
results necessary to calculate similarity. The V-SMART- 
Join framework is presented in § 4. The V-SMART-Join 
algorithms are presented in § 5. The related work is re- 
viewed in § 6. The experimental evaluation is reported in 
§ 7, and we conclude in § 8. 

2. THE MAPREDUCE FRAMEWORK 

The MapReduce framework was introduced in [11] to fa- 
cilitate crunching huge datasets on shared-nothing clusters 
of commodity machines. The framework tweaks the map 
and reduce primitives widely used in functional program- 
ming and applies them in a distributed computing setting. 

Each record in the input dataset is represented as a tu- 
ple {keyi, valuei). The first stage is to partition the input 
dataset, typically stored in a distributed file system, such as 
GFS[14], among the machines that execute the map func- 
tionality, the mappers. In the second stage, each mapper 
applies the map function on each single record to produce 
a list on the form {{keyi, valuei))*, where (.)* represents 
lists of length zero or more. The third stage is to shuffle the 
output of the mappers into the machines that execute the 



reduce functionality, the reducers. This is done by group- 
ing the mappers' output by the key, and producing a re- 
duce_valueJist of all the valuers sharing the same value of 
keyi. In addition to keyi, the mapper can optionally out- 
put tuples by a secondary key. Each reducer would then 
receive the reduce_value_list sorted by the secondary key. 
Secondary keys are not supported by the publicly available 
version of MapReduce, Hadoop [l] 1 . The input to the re- 
ducer is typically tuples on the form {keyi, {valuei)*) ■ For 
notational purposes, the reduce_value_list of key k is de- 
noted reduce_value_listfe. In the fifth stage, each reducer 
applies the reduce function on the {keyi, {valuei)*) tuple to 
produce a list of values, {values)*. Finally, the output of 
the reducers is written to the distributed file system. The 
framework is depicted in Figure 1. 

MapReduce became the de facto distributed paradigm for 
processing huge datasets because it disburdens the program- 
mer of details like partitioning the input dataset, scheduling 
the program across machines, handling failures, and manag- 
ing inter-machine communication. Only the map and reduce 
functions on the forms below need to be implemented. 

map: 

{keyi, valuei) — > {{keyi, valuei))* 
reduce: 

{keyi, {valuei)*) — > {values)* 

For better fault tolerance, the map and reduce functions 
are required to be pure and deterministic. For higher effi- 
ciency, the same machines used for storing the input can be 
used as mappers to reduce the network load. In addition, 
partial reducing can happen at the mappers, which is known 
as combining. The combine function is typically the same 
as the reduce function. While combining does not increase 
the power of the framework, it reduces the network load 2 . 

The amount of information that need to fit in the memory 
of each machine is a function of the algorithm and the input 
and output tuples. In terms of the input and output tuples, 
during the map stage, at any time, the memory needs to ac- 
commodate one instance of each of the tuples {keyi, valuei) 
and {keyi, valuei) . Similarly, during the reduce stage, the 
memory needs to accommodate one instance of each of keyi, 
valuei and values . Nevertheless, accommodating multiple 
values of {keyi, value!) , {keyi, valuei) or values allows for 
I/O buffering. Accommodating the entire reduce jvalueJist 
in memory allows for in-memory reduction. 

For more flexibility, the MapReduce framework also al- 
lows for loading external data both when mapping and re- 
ducing. However, to preserve the determinism and purity of 

'Two ways to support secondary keys were proposed in [21]. 
One of them is not scalable, since it entails loading the en- 
tire reduce_value_list in the memory of the reducer, and the 
second solution entails rewriting the partitioner, the MapRe- 
duce component that assigns instances of keyi to reducers. 
The second solution was adopted on the web page of [1]. We 
propose algorithms that avoid this limitation 
2 Combiners can be either dedicated functions or part of the 
map functions. A dedicated combiner operates on the out- 
put of the mapper. Dedicated combiners involve instanti- 
ation and destruction. On the other hand, an on-mapper- 
combiner is part of the mapper, is lightweight, but may in- 
volve fitting all the keys the mapper observes in memory, 
which can result in thrashing. This is discussed in details in 
[21]. We used dedicated combiners for higher scalability. 
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the map and reduce functions, loading is allowed only at the 
beginning of each stage. Moreover, the types of keyi, kei/2, 
valuei, value2 and values are independent 3 . 

This framework, albeit simple, is powerful enough to serve 
as the foundation for an array of platforms. Examples in- 
clude systems that support issuing SQL(-like) queries that 
get translated to MapReduce primitives and get executed 
in a distributed environment [25, 35, 32]. Another relevant 
example is adapting stream analysis algorithms to a dis- 
tributed setting by the Sawzall system [26] . 

It is difficult to analyze the complexity of a MapReduce- 
based algorithm due to several factors, including the overlap 
between mappers, shufflers and reducers, the use of combin- 
ers, the high I/O and communication cost as compared to 
the processing cost. However, to the best of our abilities, we 
will try to identify the bottlenecks throughout the sequel. 

Having described the necessary background, the insight 
for scalable MapReduce-based algorithms is described next. 

3. PROBLEM FORMALIZATION AND 
INSIGHTS 

We start by the formalization, and then use it to present 
the insight for more scalable solutions. 

3.1 Formalizing the Problem 

Given a set, S, of multisets, Mi, . . . , Mi 31 on the alphabet 
A = 01, . . . , aiA\ , find all pairs of multisets, {Mi, Mj), such 
that their similarity, Sim(Mi, Mj) exceeds some threshold, 
t. The similarity measure, Sim(., .) is assumed to be com- 
mutative. A multiset, identified by Mi, is represented as 
Mi = (A, A — > N) = {mi,i,...,m ii \A\}, where m;, fe rep- 
resents the element in multiset Mj that have the alphabet 
element ak- More formally, vrii,k = {a k , fi,k) and fi t k £ N is 
the multiplicity of at in Mi. The cardinality of Mi is denoted 
\Mi\ — S 1<fc <| A | fi,k- The set of alphabet elements that are 
present in Mi is called its underlying set, U(Mi). That is, 
U(Mi) = a k : f iik > 0. Hence, U(Mt) = {A, A -> {0,1}). 
The underlying cardinality of Mi is the number of unique 
elements present in Mi, i.e., \U(Mi)\ — \a k : fi,k > 0| [31]. 
The frequency of an element, a k , denoted Freq(ak), is the 
number of multisets ak belongs to. 

Representing multisets as non-negative vectors is trivial if 
A is totally ordered. The semantics of sets can also be used 
to represent the more general notion of multisets. A multiset 
can be represented as a set by expanding each element m;^ 
into the elements {{ak,j),l}, for 1 < j < f i>k [10]. In the 
sequel, the focus is on multisets, but the formalization and 
algorithms can be applied to sets and vectors. 

Since this work focuses only on sets, multisets, and vec- 
tors, we only consider the similarity measures that exhibit 
the Shuffling Invariant Property (SIP). A measure exhibit- 
ing SIP is agnostic to the order of the elements in the al- 
phabet A. Hence, shuffling the alphabet does not impact 
the similarity between multisets. For measures exhibiting 
SIP, the term Nominal Similarity Measures (NSMs) was 
coined in [8] 4 . All the sets, multisets, and vectors simi- 
larity measures handled in the literature we are aware of 
are NSMs. For instance, the Jaccard similarity of two sets, 
Si and Sj, is given by jg'^j . The Ruzicka similarity [7] 

3 Hadoop supports having different types for keys of the re- 
ducer input and output. The Google MapReduce does not. 
Similarity measures are surveyed in [7, 8, 15]. 



is the generalization of the Jaccard similarity to multisets. 
For any two multisets, Mj n Mj — Ea min(/ iifc , fj, k ), and 
Mi U Mj = Y^a max (/i,fci fj,k)- The set Dice similarity is 
given by 2 x j^qqqg-q- , and the set cosine similarity is given 

by ' ' = ■ Both Dice and cosine similarity can be triv- 

I Si 1 x 1 Sj 1 1 

ially generalized to multisets using the set representation of 
multiset in [10]. The vector cosine similarity is given by 
^^m'i'x iM^i'''' • these measures are agnostic to the order 
of the alphabet, and hence can be computed from partial 
results aggregated over the entire alphabet. More formally, 
NSMs can be expressed on the form of eqn. 1. 

A 

Sim{Mi,M 3 ) = F{[[ ( gi (fi,k, fj,k)l 

A 

Yl 2 (92(fi,k,fj,k)), 

A 

n L {9L{hkji,k))) (i) 

In eqn. 1, the F() function combines the partial results of 
the gi(., .) functions as aggregated over the alphabet by the 
]~[ ; aggregators, where 1 < I < L, for some constant L. 

3.2 Insight for High Scalability 

The entire alphabet does not need to be scanned to com- 
pute the partial results combined using F(). We classify 
the gi(., .) functions into three classes depending on which 
elements need to be scanned to compute the partial results. 

The first unilateral class comprises functions whose partial 
results can be computed using a scan on the elements in only 
one multiset, either U(Mi) or U(Mj). Unilateral functions 
consistently disregard either fi t k or fj z k- For instance, to 
compute the partial result |Mj|, <#(., .) is set to the identity 
of the first operand, and J} A , to the E aggregator. 

Scanning only the elements in U(Mi), instead of the entire 
A, and applying the formula Y^a k eu(Mi) /»>* yields |M»|. 

The second class of conjunctive functions can be computed 
using a scan on the elements in the intersection of the two 
multisets, U(MiPiMj). For instance, to compute the partial 
result |Mj x Mj\, gi(., .) is set to the multiplication function, 
and Yl 1 to the E aggregator. Scanning only the elements 
in U(Mi n Mj), instead of the entire A, and applying the 
formula Ea^^nM,) x y ields l M * x M i\- 

Similarly, we define the class of disjunctive functions for 
those whose partial results can only be computed using a 
scan on the elements in the union of the two multisets, 
U(MiL)Mj). For instance, to compute the symmetric differ- 
ence, \MiAMj\, <#(.,.) is set to the absolute of the difference, 
and n A ; to the £ aggregator. Scanning only the elements 
in U(Mi U Mj), instead of the entire A, and applying the 
formula J2 ak eu(M t u Mj ) IA* - fiM yields \M t AMj\. 

Given this classification of functions, it is crucial to ex- 
amine the complexity of accumulating the partial results of 
each of these classes. The partial results of the unilateral 
functions, denoted Uni(Mi) for multiset Mi, can be accu- 
mulated for all multisets in a single scan on the dataset. The 
conjunctive partial results, denoted Conj(Mi, Mj), can be 
accumulated for all pairs of multisets in a single scan on an 
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inverted index of the elements 5 . To compute the disjunctive 
partial results, for every pair of multisets that are candidates 
to be similar, their data needs to be scanned concurrently. 
Fortunately, all the similarity measures we are aware of can 
be expressed in terms of unilateral and conjunctive func- 
tions. We leave disjunctive functions for future work. All 
the published algorithms we are aware of, reviewed in § 6, 
cannot handle disjunctive function in the general case, since 
they generate candidate pairs from inverted indexes. 

Some examples are given on expressing the widely used 
similarity measures in terms of unilateral and conjunctive 
functions. The Ruzicka similarity is given by | ■ Hence, 

the Ruzicka similarity is expressed in the form of eqn. 1 when 
</i(.,.) is the min(., .) function, <?2(-,-) is the max(., .), both 
Yl x and Yl 2 are tne Yl aggregator, and Sim(Mi, Mj) is 
£a gi(/i,fc,/j,fc) _ Notice that the denominator contains the 

l^A 92(Ji,k,Jj,k) 

disjunctive function, max(.,.). Ruzicka can be rewritten as 



\MiHMj | 



\M i \ + \M j \-\M i nMj \ ' 

12 A 9l(/i,fc'/j,fc) 



which is expressible in the form of eqn. 1 



where gi(., . 



12 a 92(fi,k,fj,k)+\a3(fi,k,fj,k)\-\gi(fi,k,fj,k)\' 
the min(., .) function, g%(., .) and gz(., .) are the identity of 

the first and second operand, respectively, and n^iJ Y[ A 2' 
and 11^3 are all Y aggregators. In this example, Uni(Mi) = 
(\Mi\) = (E A g2(fi,k,hk)). Similarly, Uni(Mj) = (\Mj\) = 
(E A 93{fi.k,fi,k)). Finally, Conj(Mi,Mj) = {\Mif\Mj\) = 
{Ea9 1 (/'.''/j-*))' Similarly, the multiset cosine similarity, 
t anc l the multiset Dice similarity, 2 x ' 



V lM.lxIM,!! ' " ° — — -"V, - - \M M M 3 \ > 

is expressed in the form of eqn. 1 by setting <?i(., .) to the 
min(.,.) function, <?2(-,-) and <?3(., .) to the identity of the 
first and second operands, respectively, and setting the simi- 
larity function to f A 9 ( l(f '% f] - k) t f = for cosine, 

\/l^A 92(fi,k,fj,k)X22A 92(fi,kJj,k) 



and 2 x 



Ea 9l(fi,kJ'j,k) 



for Dice. 



Ea 92(fi,k,fj,k)xT,A 92(fi,k,fj,k) 

Given the above classification, in one pass over the dataset, 
the unilateral partial results, Uni(Mi), can be accumulated 
for each Mi, and an inverted index can also be built. The 
inverted index can then be scanned to compute the conjunc- 
tive partial results, Conj(Mi, Mj), for each candidate pair, 
(Mi,Mj), whose intersection is non-empty. The challenge 
is to join the unilateral partial results to the conjunctive 
partial results in order to compute the similarities. 

4. THE V-SMART-JOIN FRAMEWORK 

Instead of doing the join, the V-SMART-Join framework 
works around this scalability limitation. The general idea 
is to join Uni(Mi) to all the elements in U(Mi). Then, an 
inverted index is built on the elements in A, such that each 
entry of an element, a k , has all the multisets containing a k , 
augmented with their Uni(.) partial results. For each pair of 
multisets sharing an element, {Mi,Mj), this inverted index 
contains Uni(Mi) and Uni(Mj). The inverted index can also 
be used to compute the Conj(Mi, Mj). Hence, the inverted 
index can be used to compute Sim(Mi, Mj) for all pairs. 

The V-SMART-Join framework consist of two phases. 
The first joining phase joins Uni(Mi) to all the elements 
in U(Mi). The second similarity phase builds the inverted 
index, and computes the similarity between all candidate 
pairs. The algorithms of the joining phase are described in 



§ 5. In this section, the focus is on the similarity phase, 
since it is shared by all the joining algorithms. 

Each multiset, Mi, is represented in the dataset input to 
the similarity phase using multiple tuples, a tuple for each 
at, where at € Mi. We call these input tuples on the form 
{Mi, Uni(Mi),mi t k) joined tuples. This representation of 
the input data is purposeful. If each multiset is represented 
as one tuple, multisets with vast underlying cardinalities 
would cause scalability and load balancing problems. 

The V-SMART-Join similarity phase is scalable, and com- 
prises two MapReduce steps. The goal of the first step, 
Similarity 1 , is to build the inverted index augmented with 
the Uni(.) values, and scan the index to generate candidate 
pairs. The map stage transforms each entry of m^t to be 
indexed by the element a k , and caries down Uni(Mi) and 
fi ; k to the output tuple. The shuffler groups together all the 
tuples by their common elements. This implicitly builds an 
inverted index on the elements, such that the list of each el- 
ement, at, is augmented with Uni(Mi) and fi, k for each set 
Mi containing a k . For each element, a k , a reducer receives 
a reduce_value_list afc . For each pair of multisets, (Mi,Mj) 
in reduce_value_list afc , the reducer outputs the identifiers, 
{Mi,Mj), along with Uni(M z ), Uni(Mj), f i;k and f jtk . The 
map and reduce functions are formalized below. 

Y^&PSimilar ityi ■ 

{Mi, Uni{Mi),m i>k ) -> {a k ,(Mi, Uni(Mi), / i]fc )) 



VMi,M,e reduce_value_list 



TQci}lCQSimilarityi • 

{a k , ({Mi, Uni{Mi),fi, k ))*) 

{{{M t ,Mj, Uni{Mi), Uni(Mj)), {fi, k , fj, k )))* 

The second step, Similarity 2 , computes the similarity from 
the inverted index. It employs an identity map stage. A re- 
ducer receives reduce_value_list( M . Mj ) containing {fi, k , fj,k) 
for each common element, a k of a pair (Mj, Mj). The key of 
the list is augmented with UniiMi) and Uni(Mj). There- 
fore, Similarity 2 can compute Conj(Mi, Mj), and combine 
it with Uni(Mi) and Uni(Mj) using FQ. The result would 
be Sim(Mi, Mj). Since computing the similarity of pairs 
of multisets with large intersections entails aggregation over 
long lists of {fi,k,fj,k} values, the lists are pre-aggregated 
using combiners to better balance the reducers' load. The 
map and reduce functions are formalized below. 

map5imiZarity2 * 

{{M t ,M 3 , Uni(Mi), Uni{Mj)),{fi, k ,fj, k )) -> 
{{M t ,Mj, Uni(Mi), Uni(Mj)),{f t , k ,fj,k)) 



5 An inverted index groups all the multisets containing any 
specific element together. 



reduce g imilarityz • 

{{M t ,Mj, Uni{Mi), Uni{Mj)),{{fi,k,f 3 ,k)Y) 
(Mi,Mj,Sim(Mi,Mj)) 

Clearly, the performance of the similarity phase is little 
affected by changing the similarity measure, as long as the 
same gi(., .) functions are used. That is, the impact of indi- 
vidual gi{., .) functions onto the final similarity values does 
not affect the efficiency of the similarity phase. 

The slowest Similarity 1 machine is the reducer that han- 
dles the longest reduce_value_list a) . . The I/O time of this 
reducer is quadratic in max(Freq(a k )) , the length of longest 
reduce_value Jist afc • The longest reduce_value_list afe also has 
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to fit in memory to output the pairwise tuples, which may 
cause thrashing. The slowest Similarity^ machine is the re- 
ducer that handles the longest intersection of all pairs of 
multisets. This Similarity 2 slowness is largely mitigated by 
using combiners, while the Similarity^ slowness is not. 

To speed up the slowest Similarity-^ reducer and avoid 
thrashing, elements whose frequency exceeds q, i.e., shared 
by more than q multisets, for some relatively large q, can be 
discarded. These are commonly known as "stop words". 
Discarding stop words achieves better load balancing, is 
widely used in IR [5, 6, 13, 22, 29], and reduces the noise 
in the similarities when the elements have skewed frequen- 
cies, which is typical of Internet-traffic-scale applications. 
This can be done in a preprocessing MapReduce step. The 
preprocessing step maps input tuples from (Mi,m i>k ) to 
(a*,, (Mi,/i,fc)). The preprocessing reducer buffers the first 
q multisets in the reduce_value_list of a k and checks if the 
list was exhausted before outputting any (Mi,rrii tk ) tuples. 
This way, the complexity of the slowest Similarity 1 reducer 
becomes quadratic in q instead of ma,x(Freq(a k )) . 

To avoid discarding stop words, avoid thrashing and still 
achieve high load balancing, the quadratic processing can be 
delegated from an overloaded Similarity-^ reducer to several 
Similarity 2 mappers. Each overloaded reducer can dissect 
its reduce_value_list into chunks of multisets, and output all 
possible pairs of chunks. Each pair of these chunks is read 
by a Similarity^ mapper that would output all the possible 
pairs of the multisets in this pair of chunks. 

To achieve this, the reducers have to make use of the ca- 
pability of rewinding their reduce_value_lists. A Similarity \ 
reducer that receives an extremely long reduce_value Jist can 
dissect this list into T large chunks, such that each chunk 
consumes less than ^ Bytes, where B is the available mem- 
ory per machine, for some T. Each chunk is on the form 
(a k , {(Mi, Uni(Mi), /,,*,))*). The reducer outputs all the 
possible T 2 pairs of chunks in a nested loop manner, which 
entails rewinding the input T times. The output of such a 
reducer will be different from the other normal Similarity 1 
reducers, and can be signaled using a special flag. 

These T 2 pairs of chunks can fit in memory and can be 
processed by up to T 2 different Similarity 2 mappers. In- 
stead of acting as identity mappers, the Similarity 2 mappers 
process their input in a way similar to the normal Similarity 1 
reducers when receiving pairs of chunks, (Chunk p , Chunk q ), 
where 1 < p,q < T. That is, when the input is on the form 
((a k ,((Mi, tfnt(M i ),/ ilfc )n,<ak,(<M j , Uni(Mj), fj,k))*)), it 
outputs ((Mi,Mj, Uni(Mi), Uni(Mj)), (fi,k,fj,k)) for each 
Mi € Chunkp, and each Mj £ Chunk q . This better bal- 
ances the load among the Similarity-^ reducers while not 
skewing the load among the Similarity 2 mappers, without 
discarding stop words. In addition, the I/O cost of the 
slowest Similarity-^ reducer becomes proportional to T x 
max(Freg(a fe )) instead of m&x(Freq(a k )) 2 . 

5. THE JOINING PHASE ALGORITHMS 

This section describes the joining algorithms that, for each 
Mi, join Uni(Mi) to its elements. In other words, it trans- 
forms the raw input tuples on the form (Mi, m itk ) to joined 
tuples on the form (Mi, Uni(Mi),m i>k ) . 

5.1 The Online- Aggregation Algorithm 

For each input tuple, the mapper outputs the information 
necessary to compute Uni(Mi) with secondary key 0, as well 



as the same exact input tuple with secondary key 1. For each 
multiset Mi, a reducer receives reduce.valueJistM; with the 
output of the mappers sorted by the secondary key. The 
reducer scans reduce .value Jist Mi , and computes Uni(Mi), 
since the information for this computation, secondary keyed 
by 0, comes first in reduce.valueJistj^ • The reducer then 
continues to scan the elements, secondary keyed by 1, and 
outputs the multiset id, Mi with the computed partial result, 
Uni(Mi), with each element m ijfe . The map and reduce 
functions are formalized below. 

mapOniine — Aggregation-^ • 

(Mi,m i>k ) lf / ''" > "> (Mi,(),f l>k ),(Mi,l,m l>k ) 

TQC\.\lCQO n H ne — Aggregationi • 

(Afi,(0,(/i, fc )'),(l,(mi,fc)*)> -> ((Mi, Uni(Mi),mi, k ))* 

The Online- Aggregation is very scalable, straightforward, 
and achieves excellent load balancing due to using combin- 
ers. However, it assumes the shuffler sorts the reducer in- 
put by the secondary keys for sorting. As discussed in § 2, 
Hadoop provides no support for secondary keys, and the 
workarounds are either unscalable, or entails writing parts 
of the engine. Even more, we could not find any published 
instructions on how to use the combiners with the secondary 
keys workarounds in a scalable way. Next, we propose other 
scalable algorithms that can be executed on Hadoop, and 
compare the performance of all the algorithms in § 7. 

5.2 The Lookup Algorithm 

The Lookup algorithm consists of two steps. The first 
Lookup 1 step computes Uni(Mi) for each Mi. The mapper 
outputs fi ;k keyed by Mi for each input tuple M;, m;^. The 
reducers scan a reduce _value Jist Mi , and compute Uni(Mi) 
for each Mi. The output of the reducers are files mapping 
each Mi to its Uni(Mi). Combiners are also used here to 
improve the load balancing among reducers. The map and 
reduce functions are formalized below. 

map^ook^pj . 

(Mi,m iik ) lf J ' ,fc> "> (Mi,f iik ) 

(Mi,{fi, k )*) -> (Mi,(Uni{Mi))) 

When a mapper of the second step, Lookup 2 , starts, it 
loads the files produced by Lookup r into a memory-resident 
lookup hash table. As each Lookup 2 mapper scans an input 
tuple, (Mi, m i>k ), it joins it to Uni(Mi) using the lookup ta- 
ble. The output of the mappers of Lookup 2 is the same as the 
output of the mappers of Similarity 1 . Hence, the Similarity 1 
reducer can process the files output by the Lookup 2 mappers 
directly. The map function is formalized below. 

map£ 00 fc U p2 . 

(M„m,, t ) (a k ,(Mi, Uni(Mi), f i>k )) 

The Lookup algorithm suffers from limited scalability. The 
second step assumes that the results of the first step can be 
loaded in memory to be used for lookups. If the memory 
cannot accommodate a lookup table with an entry for each 
Mj, the reducers suffer from thrashing. We next propose the 
Sharding algorithm that avoids this scalability limitation. 
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5.3 The Sharding Algorithm 

The Sharding algorithm is a hybrid one between Online- 
Aggregation and Lookup. It exploits the skew in the under- 
lying cardinalities of the multisets to separate the multisets 
into sharded and unsharded multisets. Sharded multisets 
have vast underlying cardinalities, are few in numbers, and 
are handled by multiple machines in a manner similar to 
Lookup without sacrificing scalability. Any unsharded mul- 
tiset can fit in memory, and is handled in a way similar to 
the Online-Aggregation algorithm. 

The Sharding algorithm consists of two steps. The first 
Sharding 1 step is the same as Lookup 1 , with one exception. 
The reducer computes Uni(Mi), and outputs a mapping 
from Mi to its Uni(Mi) only for each multisets, Mi, whose 
|Z7(Mj)| > C, for some parameter C. The map and reduce 
functions are formalized below. 

m&PShardingx 

lit \ ^ fi,k>° , , , „ > 

(Mi,m itk ) > (Mi,f iik ) 

(M t , (f^Y) lf |[/(M ' )I>C ) {Mi,{Uni{Mi))) 

At the beginning of Sharding 2 , each mapper loads the 
output of the Sharding 1 step to be used as a lookup table, 
exactly like the case of Lookup 2 . As each Sharding 2 map- 
per scans an input tuple, (Mi,mi, k ) , it joins it to Uni(Mi) 
using the lookup table. If the join succeeds, it is estab- 
lished that |£7(Mj)| > C, and Mi is a sharded multiset. 
The mapper computes the fingerprint of a k , and outputs 
the joined tuple keyed by (Mi, fingerprint(a k )}. The goal 
of adding fingerprint(a k ) to the index is to distribute the 
load randomly among all the reducers. If the join fails, it 
is established that |{/(Mj)| < C, and hence, a list of all 
the elements in U(Mi) can fit in memory. In that case, the 
joined tuple keyed by (Mi, —1} is output. Since the second 
entry in the tuple is always —1, all the elements from Mi 
will be consumed by the same Sharding 2 reducer. Since re- 
duceJvalueJistM; fits in memory, the reducers can compute 
Uni(Mi), and join it to the individual elements in U(Mi). 

A Sharding 2 reducer receives either a tuple with Uni(Mi) 
joined in if Mi is sharded, or a tuple with no joined Uni(Mi) 
if Mi is unsharded. If the tuple has the Uni(Mi) informa- 
tion, the reducer strips off the fingerprint, and outputs a 
joined tuple for each element. If the tuple does not contain 
Uni(Mi), then Mi is unsharded, and reduce_value_listM; fits 
in memory. The reducer loads reduce_value_listM; in mem- 
ory and scans it twice. The first time to compute Uni(Mi), 
and the second time to output a joined tuple on the form 
(Mi, Uni(Mi), mi,k) for each element a k in U(Mi). The map 
and reduce functions are formalized below. 

map5'?iarding2 • 

, if Si k>0 

{Mi,m itk } ■ ► 

lookup ) ((Mi, fingerprint(a k )), ( Uni(Mi), m ijfc )) 
if Mi € Lookup 

((M u -l),(NULL,m i:k )) 
if Mi ^ Lookup 



reduce,? /i ar( ^ n g 2 ■ 

((Mi, fingerprint(a k )) , ((Uni(Mi),m it k))*) -»■ 

(Mi, Uni(Mi),m i}k ) 
{{Mi,-l),{{NULL,mi, k )y) -> (M t , Um(M t ),m^ k ) 

The Sharding algorithm is scalable, and is largely insensi- 
tive to the parameter C, as shown in § 7. The main goal of 
the parameter C is to separate the very few multisets with 
vast underlying cardinalities that cannot fit in memory from 
the rest of the multisets. This separation of multiset is crit- 
ical for the scalability of the algorithm. Therefore, the use 
of C should not be nullified by setting C to trivially large or 
small values. Setting C to a huge value stops this separation 
of multisets into sharded and unsharded categories. In that 
case, Sharding 1 reducers processing multisets with vast un- 
derlying cardinalities would be overly loaded, and would suf- 
fer from thrashing. Conversely, setting C to a trivially small 
value transforms the algorithm into a lookup algorithm, and 
the Sharding 2 mappers will have to fit in memory a lookup 
table mapping almost each Mi to its Uni(Mi). 

For the three proposed algorithms, the slowest machine is 
the reducer that handles the multiset with the largest under- 
lying cardinality. The I/O cost of these reducers is propor- 
tional to max(\U(Mi)\). However, this slowness is greatly 
reduced by using combiners. Dedicated combiners are used 
in every aggregation to conserve the network bandwidth. 

It is also worth noting that for any two measures that 
use the same <#(.,.) functions (e.g., Dice and cosine), the 
performance of the joining algorithms is little affected by 
using one over the other. 

Next, the related work is discussed with a special focus 
on the VOL algorithm [33]. VCL is used as a baseline to 
evaluate the performance and scalability of the proposed 
algorithms in § 7. 

6. RELATED WORK 

Related problems have been tackled in different applica- 
tions, programming paradigms, and using various similarity 
measures for sets, multisets, and vectors. This section starts 
by a general review, and then discusses VCL in details. 

6.1 All-Pair Similarity Join Algorithms 

Several approximate sequential algorithms employ Local- 
ity Sensitive Hashing (LSH), whose key idea is to hash the 
elements of the sets so that collisions are proportional to 
their similarity [18]. An inverted index is built on the union 
of hashed elements in all the sets. The goal is to avoid the 
quadratic step of calculating the similarity between all sets 
unless it is absolutely necessary. 

Broder et al. proposed a sequential algorithm to estimate 
the Jaccard similarity between pairs of documents [5, 6] us- 
ing LSH. In [5, 6], each document is represented using a set, 
Si, comprising all its shingles, where a shingle is a fixed- 
length sequence of words in the document. A more scalable 
version of the algorithm is given in [22] in the context of de- 
tecting attacks from colluding attackers. The LSH process 
was repeated using several independent hash functions to 
establish probabilistic bounds on the errors in the similar- 
ity estimates. While these algorithms considered sets only, 
they can employ the set representation of multiset proposed 
in [10] to estimate the generalized Ruzicka similarity. 
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Figure 2: The distribution of elements per multiset. 

LSH was also used in [9] to approximate other similar- 
ity measures such as the Earth Mover Distance (EMD) be- 
tween distributions 6 [28], and the cosine similarity between 
sets. However, the estimated similarities have a multiplica- 
tive bias that grows linearly with log(jAj) log log(j Aj), which 
might be impractical for large alphabets, such as cookies 7 . 

Using inverted indexes is proposed to solve the all-pair 
similarity join problem exactly in [29]. Instead of scanning 
the inverted index and generating all pairs of sets sharing an 
element, the algorithm in [29] proceeds in two phases. The 
first candidate generation phase scans the data, and for each 
set, Si, selects the inverted index entries that correspond to 
its elements. The algorithm then sorts the elements in this 
partial index by their frequency in order to exploit the skew 
in the frequencies of the elements. The algorithm dissects 
these elements into two partial indexes. The first partial 
index comprises the least frequent elements (i.e., elements 
with short lists of sets), and is denoted Prefix(Si). The 
second index comprises the most frequent elements (i.e., el- 
ements with long lists of sets), and is denoted Suffix(Si). 
The length of Suffix (Si) is determined based on \Si\ and t, 
such that the similarity between Si and any other set can- 
not be established using only all the elements in the suffix. 
The candidate generation phase merges all the lists in the 
prefix and generates all the candidates that may be similar 

6 Given two piles of dirt in the shapes of the distributions, the 
distance measure is proportional to the effort to transform 
one pile into the other. 

7 [16] has reported the bias factor grows linearly with |^4|. In 
another analysis [17], Henzinger reported that the algorithm 
in [9] is more accurate than the algorithm in [5, 6] on the ap- 
plication of detecting near-duplicate web pages when using 
the same fingerprint size. That is attributed to the ability of 
[9] to respect the repeated shingles in the documents. The 
number of independent hash functions used in [17] is 84. It 
is notable that this is significantly less than the number of 
hash functions proposed in [22] of 423 to guarantee an er- 
ror bound of 4% with confidence 95%. Clearly, [17] did not 
consider the set representation of multisets described in [10] . 



Figure 3: The distribution of multisets per element. 

to Si. In the second verification phase, the candidates are 
verified using the elements in the suffix. By dissecting the 
partial index of Si into a prefix and a suffix, the threshold 
t is exploited and the expensive step of generating all the 
candidates sharing any element in their suffixes is avoided. 

Several pruning techniques were proposed to further re- 
duce the number of candidates generated. One such promi- 
nent technique is prefix filtering [10, 4, 34]. The technique 
builds an inverted index only for the union of the prefix ele- 
ments of all the sets, which reduces the size of the inverted 
indices by a approximately 1 — t, according to [34]. Similarly, 
[34] proposed suffix filtering. In fact, [34] bundled prefix fil- 
tering and suffix filtering into a state of the art sequential 
algorithm, PPJoin+, along with positional filtering (the po- 
sitions of the elements in any pair of overlapping ordered 
sets can be used to upper bound their similarity), and size 
filtering [2] (similar sets have similar sizes from the pigeon- 
hole concept). Integrating most of these pruning techniques 
algorithmically was investigated in [19]. 

The MapReduce-based algorithm in [13] approximate the 
multiset similarity using the vector cosine similarity. The 
algorithm and the approximation is adopted in [3] with op- 
timizations borrowed from [4] to reduce the communication 
between the machines and distribute the load more evenly. 
These techniques represent multisets as unit vectors, which 
ignores their cardinalities. This approximation allows for de- 
vising simple MapReduce algorithms. However, these tech- 
niques are not applicable when multisets are skewed in size, 
and the sizes of the multisets are relevant, which is typical 
in Internet-traffic application. In addition, these techniques 
provide approximate similarities, which obviates the use of 
the MapReduce framework that can be used to crunch large 
datasets to provide exact results. 

The PPJoin+ algorithm is adopted in a MapReduce set- 
ting in [33] for database joins. Since this is the only algo- 
rithm that is exact, distributed, and versatile, it is used as 
a benchmark and is explained in details next. 
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6.2 The VCL Algorithm 

The VCL algorithm 8 was devised for set similarity joins 
where the sets come from two different sources. The algo- 
rithm was also adapted to solve the all-pair similarity join 
problem where the sets come from the same source, which 
is the problem in hand. While the work in [33] targets sets, 
it is applicable to multisets and vectors. 

VCL is a MapReduce adaptation of PPJoin+ proposed in 
[34] that reduces the number of candidate pairs by combin- 
ing several optimizations. In fact, the main MapReduce step 
of VCL relies on prefix filtering, explained in § 6.1. To ap- 
ply the candidate pairs filtering technique [34] , VCL makes a 
preprocessing scan on the dataset to sort the elements of the 
alphabet, A, by frequency. During the initialization of the 
mappers of the main phase, all the elements, sorted by their 
frequencies, are loaded into the memory of the mappers. 

Each mapper processes a multiset at a time, and each 
multiset is processed by one mapper. For each multiset, Mi, 
the mapper computes the prefix elements of Mi, and outputs 
the entire content of Mi with each element G Prefix (Mi). 
VCL uses the MapReduce shuffle stage to group together 
multisets that share any prefix element. Hence, each re- 
ducer receives a reduce key, element a^, along with the re- 
duce_valueJist a( . comprising all the multisets for which at is 
a prefix element. For each multiset in the reduce_value_list afc , 
the reducer has the elements of the entire multiset, and 
can compute the similarity between each pair of multisets. 
This algorithm computes the similarity of any two multi- 
sets on each reducer processing any of their common pre- 
fix elements. These similarities are deduplicated in a post- 
processing phase. The map and reduce functions of the ker- 
nel, i.e., main, phase are formalized below. 

map vcl ■ 

va fc e Prefix(Mi) 
{Mi,{mi,i>---, m i,\A\}} > 

((o fc , {Mi, {mi,!, . . . ,m;,|A|})))* 



reduce \ 



VM„M j 6 reduce_value_list 
,ra;,|A|}>) ) ' > 



(ofc, ({Mi, {m M , 

((M l ,M j ,Stm(M l ,M j ))y 

VCL suffers from major inefficiencies in the computation, 
network bandwidth, and storage. For each multiset, Mi, the 
map stage incurs a network bandwidth and storage cost that 
is proportional to \Prefix(M t )\ x \U(Mi)\. Hence, the map 
bottleneck is the mapper handling the largest multiset. This 
constituted a major bottleneck in the reported experiments. 
In addition, the reducers suffer from high redundancy. Each 
pair of multisets, Mi and Mj , have their similarity computed 
\Prefix(Mi) f] Prefix (Mj)\ times. This inefficiency cannot be 
alleviated using combiners. 

To reduce this inefficiency, grouping of elements into super- 
elements was proposed in [33]. Representing multisets in 
terms of super-elements shrinks the multisets, and hence 
reduces the network, memory, and disk footprint. Group- 
ing elements shrinks the alphabet, and hence a list of the 
super-elements, sorted by their frequencies, can be more 
easily accommodated in the memories of the VCL kernel 



8 The algorithm is referred to as VCL after the names of the 
authors of [33]. 



mappers. In addition, grouping reduces the number of ker- 
nel reducers calculating the similarity of pairs of multisets. 
The kernel reducers produce a candidate pair of multisets 
if their similarity of super-elements exceeds the threshold, 
t. Grouping produces "superfluous" pair of multisets that 
can share a prefix super-element, while not sharing a prefix 
element. These superfluous pairs are weeded out in the post- 
processing phase. In the experiments in [33], grouping was 
shown to consistently introduce more overhead than savings 
due to the superfluous pairs, and the authors suggested us- 
ing one element per group. This renders the VCL algorithm 
incapable of handling applications where the alphabet has 
to fit completely in memory of the mappers. 

The VCL algorithm suffers from another major scalabil- 
ity bottleneck. In the map function of the kernel phase and 
the post-processing phase, entire multisets are read, pro- 
cessed, and output as whole indivisible capsules of data. 
Hence, VCL can only handle multisets that can fit in mem- 
ory. This renders the algorithm inapplicable of handling 
Internet-traffic-scale applications, where the alphabet could 
be the cookies visiting Google, and the multisets could be 
the IPs visiting Google with these cookies. 

7. EXPERIMENTAL RESULTS 

To establish the scalability and efficiency of the V-SMART- 
Join algorithms, experiments were carried out with datasets 
of real IPs and cookies. Each IP was represented as a multi- 
set of cookies, where the multiplicity is the number of times 
the cookie appeared with an IP. The similarity measure used 
was Ruzicka. The experiments were conducted using two 
datasets from the search query logs. The first dataset is 
of much smaller size and it had approximately 133 Million 
unique elements (cookies) shared by approximately 82 Mil- 
lion multisets (IPs). The first dataset was used so that all 
the algorithms can finish processing it. This smaller dataset 
was used as a litmus test to know which algorithms will be 
compared on the second dataset. 

The second dataset is of a more realistic size, and is used 
to know which algorithms can solve the all-pair similarity 
join problem in an Internet-traffic-scale setting, and com- 
pare their efficiency. The second dataset had approximately 
2.2 Billion unique elements (cookies) shared by approxi- 
mately 454 Million multisets (IPs). The distributions of the 
multisets and elements are plotted in Fig. 2 and Fig. 3. 

Clearly, both the multisets, the IPs, and the alphabet, the 
cookies, are in the order of hundreds of millions to billions. 
In addition, the distributions are fairly skewed. However, no 
stop words were discarded, and no multisets were sampled. 

The algorithms analyzed in this experimental evaluation 
are the proposed algorithms as well as the state of the art 
algorithm, VCL. We did not include the LSH-based algo- 
rithms since the existing algorithms are serial, and general- 
izing them to a distributed setting is beyond the scope of 
this work. In addition, LSH algorithms are approximate. 
Using the computing power of multiple machines in a par- 
allel setting obviates the need to approximation, especially 
if the exact algorithms can finish within reasonable time. 

All the algorithms were allowed 1GB of memory, and 
10GB of disk space on each of the machines they ran on, 
and they all ran on the same number of machines. All the 
algorithms were started concurrently to factor out any mea- 
surement biases caused by the data center loads. All the 
reported run times represent a median-of-5 measurements. 
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Figure 4: Algorithms run time on the small dataset 
with various similarity thresholds (500 machines). 

The results of comparing the algorithms on the small and 
realistic datasets are reported in § 7.1 and § 7.2, respectively. 
We also conduct a sensitivity analysis of the Sharding algo- 
rithm with respect to the parameter C in § 7.3. Finally, we 
briefly comment on discovering load balancers in § 7.4. 

7.1 Algorithms Comparison on the 
Small Dataset 

The first step in comparing the algorithms on the small 
dataset was to run each algorithm on the same number of 
machines, 500, and to vary the similarity threshold, t, be- 
tween 0.1 and 0.9 at an 0.1 interval. Understandably, all 
the algorithms produced the same number of similar pairs 
of IPs for each value of t. The results are plotted in Fig. 4. 
Clearly, the performance of the VCL algorithm in terms of 
run time was not close to any of the V-SMART-Join algo- 
rithms. In addition, its performance was highly dependent 
on the similarity threshold, t. It is also worth mentioning 
that at least 86% of the run time of VCL was consumed 
by the map phase of the kernel MapReduce step, where the 
multisets get replicated for each prefix element. The V- 
SMART-Join algorithms were fairly insensitive to t. Their 
run time decreased very slightly as t increased, since less 
pairs were output, which reduces the I/O time. 

The Online-Aggregation algorithm was consistently the 
most efficient. Online- Aggregation executed 30 times faster 
than VCL when the similarity threshold was 0.1. When 
the threshold was increased to 0.9, the performance of VCL 
improved to be only 5 times worse than Online- Aggregation. 
Online- Aggregation was followed by Lookup, and then Shard- 
ing, with slight differences in performance. This was ex- 
pected, since the Online-Aggregation joining needs only one 
MapReduce step. The Lookup algorithm saves a MapReduce 
step compared to the Sharding algorithm. 

How the algorithms scale out relative to the number of 
machines was also examined. All the algorithms were run 
to find all pairs of similarity 0.5 or more, and the number 



Figure 5: Algorithms run time on the small dataset 
with various numbers of machines (t = 0.5). 

of machines were varied from 100 to 900 at an interval of 
100 machines. Again, the VCL algorithm performed a lot 
worse than the V-SMART-Join algorithms. In addition, 
when the algorithm ran on over 500 machines, it did not 
make much use of the machines. The reason is that the 
bottleneck of the runs was outputting each large multiset 
with each one of its prefix elements. This results in a huge 
load unbalance. That is, some of the machines that handle 
the large multisets become very slow, which is independent 
of the number of machines used. When using 900 machines 
instead of 100 machines, VCL run time dropped by 35%. 

On the other hand, the V-SMART-Join algorithms con- 
tinued to observe a relative reduction in the run time as 
more machines were used. This speed up was hampered by 
the fact that a large portion of the run times were spent 
in starting and stopping the MapReduce runs. The algo- 
rithm that exhibited the most reduction in run time was 
Online-Aggregation, whose run time dropped by 53%, while 
the Lookup showed the least reduction in run time with a 
drop of 32%. This is because part of the run time of Lookup 
was loading the lookup table mapping each Mi to Uni(Mi) 
on each machine, which is a fixed overhead regardless of 
the number of machines used. Again, Online-Aggregation 
outperformed VCL by 11 to 15 times depending on the sim- 
ilarity threshold. 

7.2 Algorithms Comparison on the 
Realistic Dataset 

The algorithms were run on the more realistic dataset, and 
the results are presented below. It is worth mentioning that 
Lookup did not succeed because it was never able to load the 
entire lookup table mapping each Mi to Um(Mi). Hence, 
Lookup was out of the competition. Similarly, the VCL al- 
gorithm was not able to load all the cookies, sorted by their 
frequency. To remedy this, the cookie elements were sorted 
based on their hash signature instead of their frequencies. 
However, even with this modification, VCL never finished 
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the runs within two days. The mappers of the kernel step 
took more than 48 hours to finish, and were killed by the 
MapReduce scheduler. 

The remaining algorithms, Online-Aggregation and Shard- 
ing, were compared. The similarity phase is common to both 
algorithms. Hence, the time for running the joining phase 
was measured separately from the time for running the sim- 
ilarity phase. Since these algorithms do not get affected by 
the similarity threshold, only their scaling out with the num- 
ber of machines was compared. The algorithms were run to 
find all pairs of similarity 0.5 or more, and the number of 
machines were varied from 100 to 900 at an interval of 100 
machines. The results are plotted in Fig. 6. From the fig- 
ure, both algorithms, as well as the common similarity step 
were able to scale out as the number of machines increased. 
Online-Aggregation took roughly half the time of Sharding. 

7.3 How Sensitive is Sharding to C? 

The previous section shows that while the Sharding al- 
gorithm is half as efficient as the Online- Aggregation algo- 
rithm, it is still scalable. The main advantage of Sharding 
is it docs not use secondary keys, which are not supported 
natively by Hadoop. On the other hand, Sharding takes 
a parameter C. The function of parameter C is to sepa- 
rate the multisets with vast underlying cardinalities, whose 
Uni{.) functions are calculated and loaded in memory as 
the Sharding 2 mappers start, from the rest of the multisets, 
whose Uni(.) are calculated on the fly by the Sharding 2 re- 
ducers. A sensitivity analysis was conducted on the perfor- 
mance of the Sharding algorithm as the parameter C was 
varied. The run time of the Sharding 1 and Sharding 2 steps, 
as well as their sum, are plotted in Fig. 7 as the parameter 
C is varied between 2 5 and 2 15 using exponential steps. 

The run time of the Shardmg 1 step decreased since less 
pairs were output as C increased, which reduced the I/O 
time. On the other hand, the run time of the Sharding 2 
step increased since more on the fly aggregation is done as 
C increased. The total run time of the Sharding algorithm 
stayed stable throughout entire range of C. More precisely, 
the total run time had a slight downward trend until the 
value of C was roughly 1000 and then increased again. No- 
tice however that larger values of C reduce the memory foot- 
print of the algorithm, and are then more recommended. 

7.4 A Comment on Identifying Proxies 

We conclude the experimental section by briefly discussing 
the discovered IP communities. For each similarity thresh- 
old, a manual analysis was done on a random sample of the 
similar IPs. Each threshold was judged based on its cover- 
age, i.e., the number of discovered similar IPs, and the false 
positives of the sample. False positives are defined as IPs in 
the results that cannot be proxies. Similar IPs are judged 
as not proxies based on evidences independent of this study. 
An example is the case when two IPs that were judged by 
this approach to be similar belong in fact to two different 
organizations. Clearly, setting t to 0.1 yields the highest 
coverage, but also the highest false positives. 

To reduce the false positives, instead of reducing the simi- 
larity threshold, IPs that observed less than 50 cookies were 
filtered out. This almost eliminated the false positives for all 
the thresholds, since it eliminated all the IPs that have very 
low chance of acting as proxies. After eliminating these IPs, 
the number of cookies were around two orders of magnitude 



larger than the number of IPs. It is expected to find a lot 
more cookies than IPs in proxy settings. 

Notice that this filtering of small IPs would not improve 
the reported performance of VCL, though it would improve 
the reported performance of Lookup. The reason is the main 
bottleneck of VCL are multisets with vast underlying cardi- 
nalities. These bottleneck multisets are the most important 
to identify in order to discover load balancer, and should not 
be filtered out. On the other hand, by reducing the num- 
ber of multisets, the Lookup algorithm reduces the I/O time 
of reduceLoofcupi responsible for producing the data for the 
lookup table mapping each Mi to Uni(Mi). It is also worth 
noting that this filtering allowed the Lookup algorithm to 
accommodate the lookup table of the realistic dataset, and 
was able to finish the run in time very comparable to the 
Online-Aggregation algorithm. 

The overwhelming majority of the discovered load bal- 
ancers were in European countries. The seven largest strongly 
connected sets of IPs spanned several subnetworks, and com- 
prised thousands of IPs. The load balancers in Saudi Arabia 
and North Korea were few, but were the most active. 

8. DISCUSSION 

The V-SMART-Join MapReduce-based framework for dis- 
covering all pairs of similar entities is proposed. This work 
presents a classification of the partial results necessary for 
calculating Nominal Similarity Measures (NSMs) that are 
typically used with sets, multisets, and vectors. This clas- 
sification enables splitting the V-SMART-Join algorithms 
into two stages. The first stage computes and joins the par- 
tial results, and the second stage computes the similarity for 
all candidate pairs. The V-SMART-Join algorithms were 
up to 30 times as efficient as the state of the art algorithm, 
VCL, when compared on real small datasets. We also estab- 
lished the scalability of the V-SMART-Join algorithms by 
running them on a dataset of a realistic size, on which the 
VCL mapper never succeeded to finish, not even when VCL 
was modified to improve scalability. 

We touch on the reason why we did not incorporate prefix 
filtering into the proposed algorithms. While prefix filtering 
reduces the generated candidates from any pair of multisets 
sharing an element to only those that share a prefix ele- 
ment, employing it in a MapReduce algorithm introduces 
a scalability bottleneck, which defeats the purpose of using 
MapReduce. First, loading a list of all the alphabet ele- 
ments, sorted by their frequencies, in memory to identify 
the prefix elements of each entity renders prefix filtering in- 
appropriate for handling extremely large alphabets. This 
was a bottleneck for the algorithms in [3, 33]. Extremely 
large alphabets and entities are common in Internet-traffic- 
scale applications. While [33] proposed grouping elements 
to reduce the memory footprint of prefix filtering, their ex- 
periments showed the inefficiencies introduced by grouping. 
Second, the approach of generating candidates and then ver- 
ifying them entails machines loading complete multisets as 
indivisible capsules. This limits the algorithms in [3, 33] to 
datasets where pairs of multisets can fit in memory. Finally, 
as clear from the experiments, prefix filtering is only effec- 
tive when the similarity threshold is extremely high. Prefix 
filtering becomes less effective when the similarity thresh- 
old drops. As was clear from our application, the threshold 
was set to a small value (0.1) to find all similar IPs, which 
minimizes the benefits of prefix filtering. 
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Figure 6: Algorithms run time on the large dataset 
with various numbers of machines (t = 0.5). 

The main lesson learned from this work is that devising 
new algorithms for the MapReduce setting may yield algo- 
rithms that are more efficient and scalable than those de- 
vised by adopting sequential algorithms for this distributed 
setting. Adopting sequential algorithms to the distributed 
settings may overlook capabilities and functionalities offered 
by the MapReduce framework, ft is also crucial to devise 
algorithms that are compatible with the publicly available 
version of MapReduce, Hadoop, for wider adoption. 

Finally, it is constructive to identify the limitations of 
this work. The proposed algorithms, as well as others in the 
literature, handles only NSMs whose partial results can be 
computed either by scanning the two entities, or by scanning 
the intersection of the two entities. That is, the algorithms 
do not handle NSMs if any of its partial results entail scan- 
ning the elements in the union of the two entities. This 
still makes this work applicable to a large array of similarity 
measures, such as Jaccard, Ruzicka, Dice, and cosine. 

In addition, this work assumes large scale datasets with 
numerous entities, numerous elements, and a skew in the 
sizes of the entities. The skew in the sizes of the entities 
enabled the sharding algorithm to categorize entities into 
sharded and unsharded entities. This work is not applicable 
to datasets with numerous entities and very few elements. 
For instance, if the entities represent distribution histograms 
of a moderate number bins, and the elements represent the 
bins, almost each bin would be shared by almost all the 
entities. In that case, the algorithm would have to do an 
exhaustive pairwise similarity join, which is very unscalable. 
Our future work focuses on devising a MapReduce-based 
algorithm for all-pair similarity joins of histograms. 
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Figure 7: The run time of Sharding on the large 
dataset with various values of the parameter C. 
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